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Abstract — A new very large scale integration (VLSI) 
algorithm for a 2 N -length discrete Hartley transform (DHT) that 
can be efficiently implemented on a highly modular and parallel 
VLSI architecture having a regular structure is presented. The 
DHT algorithm can be efficiently split on several parallel parts 
that can be executed concurrently. Moreover, the proposed 
algorithm is well suited for the sub expression sharing technique 
that can be used to significantly reduce the hardware complexity 
of the highly parallel VLSI implementation. Using the 
advantages of the proposed algorithm and the fact that we can 
efficiently share the multipliers with the same constant, the 
number of the multipliers has been significantly reduced such 
that the number of multipliers is very small comparing with that 
of the existing algorithms. Moreover, the multipliers with a 
constant can be efficiently implemented in VLSI. 

Index Terms — Discrete Hartley transform (DHT), DHT 
domain processing, fast algorithms. 


I. INTRODUCTION 

The discrete fourier transform (DFT) is used in many 
digital signal processing applications as in signal and image 
compression techniques, filter banks [1], signal 
representation, or harmonic analysis [2]. The discrete Hartley 
transform (DHT) [2], [3] can be used to efficiently replace the 
DFT when the input sequence is real. In the literature, there 
are some fast algorithms for the computation of DHT [4] [7] 
and some algorithms for the computation of generalized DHT 
[ 8 ]—[ 10 ]. 

There are also several split-radix algorithms for 
computing DHT with a low arithmetic cost. Thus, Sorensen et 
al. [11] and Malvar [12] proposed split-radix algorithms for 
DHT with a low arithmetic cost. Bi [13] proposed another 
split-radix algorithm where the odd-indexed transform 
outputs are computed using an indirect method. The classical 
split-radix algorithm is difficult to implement on VLSI due to 
its irregular computational structure and due to the fact that 
the butterflies significantly differ from stage to stage. Thus, it 
is necessary to derive new such algorithms that are suited for a 
parallel VLSI system. 

There are also in the literature several fast 
algorithms that use a recursive strategy as that in [14] for 
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discrete cosine transforms (DCT) and that in [10] for 
generalized DHT. Since DHT is computationally intensive, it 
is necessary to derive dedicated hardware implementations 
using the VLSI technology. 

One category of VLSI implementations is 
represented by systolic arrays. There are many systolic array 
implementations of DHT [15]—[18]. Systolic array 
architectures are modular and regular, but they use 
particularly pipelining and not parallel processing to obtain a 
high-speed processing. 

In the literature, highly parallel solutions as those in 
[8] and [19] were also proposed. In [8], a highly parallel and 
modular solution for the implementation of type-III DHT 
based on a new VLSI algorithm is proposed. In [19], we have 
a highly parallel solution for the implementation of DHT 
based on a direct implementation of fast Hartley transform 
(FHT). It is worth to note that hardware implementations of 
FHT are rare. 

Multipliers in a VLSI structure consume a large 
portion of the chip area and introduce significant delays. This 
is the reason why memory-based solutions to implement 
multipliers have been more and more used in the literature 
[15], [20]-[24]. To efficiently implement multipliers with 
lookup-table-based solutions, it is necessary that one operand 
to be a constant. When one of the operands is constant, it is 
possible to store all the partial results in a ROM, and the 
number of memory words is significantly reduced from 2 to 
2 l . 

In this brief, a new VLSI DHT algorithm that is well 
suited for a VLSI implementation on a highly parallel and 
modular architecture is proposed. It can be used for designing 
a completely novel VLSI\ architecture for DHT. Moreover, 
using sub expression sharing technique [25] and sharing the 
multipliers with the same constant, the hardware complexity 
can be significantly reduced, the number of multipliers being 
very small, significantly less than that in [8]. In the proposed 
solution, we have used only multipliers with a constant that 
can be efficiently implemented in VLSI. The proposed 
solution is not only appealing by its high level of parallelism 
and by using a modular and regular structure but it can be also 
used to obtain a small hardware complexity by extensively 
sharing the common blocks. 

The rest of this brief is organized as follows. In 
Section II, we present a new algorithm for computing an N 
-point DHT. In Section III, we present an algorithm for a 
small-length DHT. In Section IV, we analyze the arithmetic 
cost, and in Section V, we present some examples of our 
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algorithm. In Section VI, we present the new VLSI 
architecture. The conclusion is presented in Section VII. 


II. NEW VLSI ALGORITHM LOR DHT 

Let N>4 be a power of two. For any real input 
sequence {x(i):i=0,l,...,N-l}, the DHT(N) is defined by 

X(k) = DHT (N) {x(i)} = 

x (0 1 c ^ 5 [ 2ki7T /ft] for k=0,1,...,N-1-(1) 

Where cas(x)=cos(x)+sin(x). 

We can compute a N-length DHT using a new 
algorithm given by the following relations: 

X N (k){x(i)}= X N/2 (k){x(2i)}+u(0).sin(2k7r/N) + 

[X N/2 (k) {u(i)} -u(0)/2] .2.cos(2k7i/N)-(2) 

X N (N/2+k) {x(i)}= 


X N/2 (k) {x(2i)} -u(0). sin(2k7i/N)- [X N/2 (k) {u(i)} -u(0)/2]. 2 .cos( 
2k7i/N) for k=0,l,....N/4-(3) 
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X N (N/2-k) {x(i)}= 

X N / 2 (N/2-k){x(2i)}+u(0).sin(2k7r/N)-[X N / 2 (N/2-k){u(i)}-u(0) 
/2].2.cos(2k7i/N) for k=0,l,....N/4 — (4) 


III. ALGORITHM LOR A SMALL DHT 

An efficient implementation of a fast DHT algorithm 
closely depends on an efficient algorithm for a small DHT. 
We present here an efficient DHT algorithm for a length N=8 


X N (N-k){x(i)}= 

X N / 2 (N/2-k) {x(2i)} -u(0). sin(2k7i/N)- [X N/2 (N/2-k) {u(i)} -u(0)/ 
2].2.cos(2k7i/N) for k=0,l,....N/4 — (5) 

Where 

--i 

X N /2(k){x(2i)J =£? =0 x(.2i)cas [2kiTtfQz] -(6) 

^-1 

X N/2 (k){U(i)} =E; =C v(i)cas [2kix/Q -(7) 

are DHT of length N/2, with {u(i) : i=0,l .(N/2)-l} an 

auxiliary input sequence given by 

u(N/2-l) = x(N-l)-(8) 

u(i) =x(2i+l)-u(i+l) fori=(N/2)-2,...l,0-(9) 

Lor the computation of (2)-(5), there are necessary 
extra7 N/4 additions and N/2 multiplications, if we share the 
multipliers with the same constant. Lor the computation of the 
auxiliary input sequence using (8) and (9), there are necessary 
extra N/2-1 additions. 

The obtained algorithm can be used as a VLSI algorithm 
where the number of multipliers can be significantly reduced 
by sharing the multipliers with the same constant as will be 
shown in Section VI. The number of multipliers can be further 
reduced using sub expression sharing techniques and the 
sharing of multipliers with the same constant, as shown in 
Section VI. 


Table-I 

COMPUTATIONAL COMPLEXITY 


X(0) = [(x(0)+x(4))+(x(2)+x(6))+( x(l)+x(5))+( x(3)+x(7))] 
X(2) = [(x(0)+x(4))-(x(2)+x(6))+( x(l)+x(5))-( x(3)+x(7))] 
X(4) = [(x(0)+x(4))+(x(2)+x(6))-( x(l)+x(5))-( x(3)+x(7))] 
X(6) = [(x(0)+x(4))-(x(2)+x(6))-( x(l)+x(5))-( x(3)+x(7))] 
X(l) = [(x(0)-x(4))+(x(2)-x(6))+c(x( 1 )-x(5))] 

X(3) = [(x(0)-x(4))-(x(2)-x(6))+c(x(3)-x(7))] 

X(5) = [(x(0)-x(4))-(x(2)-x(6))-c(x( 1 )-x(5))] 

X(7) = [(x(0)-x(4))-(x(2)-x(6))-c(x(3)-x(7))] 

With c=\ ,r 2 

We have M DHT(8) =2 and A DHT(8) =16 as defined in 
the following. Due to the fact that we have to multiply with the 
same constant 64 c we can share the same multiplier, thus 
further reducing the number of multipliers. 

IV. ARITHMETIC COST 

Let A DHT( N ) and M DHT( N ) denote the number 
of additions and multipliers for computing DHT( N ) .We 
have 

Mdht(N) -2M DHT ( N/2 ) +(1/2)N-(10) 

Adht(N) -2 A DH t(n/2) +(9/4)N-1-(11) 

Where M DH t( S )=2 and A DH t(s)=16 Solving the recursions 
(10) and (11), we obtain 

Mdht(N)= “ iY (Icigj ^ ~ S)-(12) 

Adht(N)= 7 N log 3 .V --rN + 1-(13) 

-ir Zr 
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Table I lists the required number of multiplications 
and additions for the proposed algorithm, the Sorensen one 
and Bi algorithm, where rotations are implemented with four 
multiplications and two additions (Radix-2 [13] * ) and with 
three multiplications and three additions (Radix-2 [13] **). 

The values of M in the proposed algorithm are computed 
considering that the multipliers with the same constant are 
shared. The number of multipliers in Sorensen algorithm [11] 
is significantly greater than that in the proposed one. The 
number of multipliers for Bi algorithm where rotations are 
implemented with four multiplications and two additions is 
greater than the necessary number of multipliers for our 
algorithm and slightly smaller when the rotations are 
implemented with three multiplications and three additions. 
However, the split-radix algorithm has an irregular structure 
and is difficult to be implemented in hardware as opposed to 
our algorithm that has a regular and modular structure and can 
be very easily implemented in parallel as it will be shown in 
Section VI for a DHT of length N=32 . Moreover, the number 
of multipliers in the proposed implementation can be 
significantly further reduced by sharing multiplications as 
shown in Section IV. 

V.EXAMPLE OF THE PROPOSED ALGORITHM 

We shall illustrate the main features of the proposed 
algorithm considering a DHT of length N =32 

A.DHT of Length N=32 

We first compute recursively the auxiliary input 
sequences 

u^ClS) =i(31) -—----— (14) 

u ' (0 = ((x (2 c -f 1) ) — u + lj for i = 

CM, .. 14 --—-—--- -Hi 5) 

v [ (7) = x( 3 2) ----- (16) 

v • G'J — ((r (4c E 2J ) — v “■'([ E l) for i = 

0,1*... .....6 —---- -Hi7) 

u [ D (7) = u M Cl 5) —- -h;;i s) 

u 1 1 ' 1 Cl3 = u '■ (2 f E 1) — u 1 1 ' 1 (e E l3 for i — 

0,1 .. & ----- (19) 

Then, we have to compute in parallel (21)-(28). 


very good potential for using hardware sharing techniques, 
and many sub expressions have been used in common. We 
can thus significantly reduce the hardware complexity of the 
VLSI implementation. Moreover, due to the fact that the same 
constant is used in several multiplications, we can use the 
technique of sharing the multipliers with the same constant. 
Having only multiplications with a constant, we can 
efficiently implement these multipliers in VLSI. 


VI. HIGHLY PARALLEL VLSI ARCHITECTURE 


In order to clearly illustrate the features and 
advantages of the proposed algorithm, the VLSI architecture 
for a DHT of length N=32 is presented in Fig. 1(a) and (b). It 
can be seen that the proposed architecture is highly parallel 
and has a modular and regular structure being formed of only 
a few blocks: U, MUL, ADD/SUB, XCH, and a few 
additional adder s/subtracters. The “U” blocks implement 
(20), XCH blocks interchange the values and are simply 
implemented in hardware by appropriate wiring, and MUL 
blocks are used to implement the shared multipliers with a 
constant. This block contains four multipliers with a constant. 
Each multiplier is shared by four input sequences that are 
multiplied with the same constant in an interleaved manner 
using multiplexers and demultiplexers controlled by two 
clocks. One of the advantages of this algorithm and 
architecture is the fact that the multiplications with the same 
constant are shared in the MUL blocks. Thus, the number of 
multipliers is significantly less than the value 40 given in 
Table I which has become now only 16. The final values Y(k) 
of Section A and Y0(k) of Section B are finally added to 
obtain the output sequence Y(k) using an additional adder not 
presented in Fig. 1 for simplicity. 

The proposed architecture has a high throughput of 
32 samples per clock and can be pipelined. It is highly parallel 
using a low hardware complexity structure. The multipliers 
with a constant in MUL blocks can be efficiently implemented 
in hardware using the techniques proposed in [20]-[24]. 
Parallel processing is one of the major ways to reduce power 
consumption, the high processing speed being traded off for 
low power using the reduction of the supply voltage value 
[26]. The required control structure is very simple which is 
another important advantage. We define another module as 

If# ft) fx* (0}=Y & (k) {x a (0} - (03/2-(20) 

For X 32 (k), we can write the following relations: 


These equations have been obtained by a further 
reformulation of the equations obtained directly from (2)-(5) 
in such a way that we can extensively use the technique of sub 
expression sharing [18] and sharing the multipliers with the 
same constant. Thus, the number of multipliers has been 
significantly reduced at only 16, a significantly lower value 
than the theoretical value 40 from Table I that has been 
obtained using (2)-(5) without using the aforementioned 
technique. As can be seen, the proposed VLSI algorithm has a 
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X q 2 (k) be CO } = [ U* 00 l x (40 }+^g (kj {u ' CO } .2.cos(2k7i/16)]+x(0)/2+i? ' (O) .sin(2k7i/16)+ u ■ " (0 j 
Sin(2k7i/32)+ E/ B (kJ {u (2r')}+ U s (kj {u 1J (Q} .2.cos(2k7r/16)].2.cos((2k7i/32)+ u - 1 (0 ).2.sin( 

2k7r/16)cos(2k7i/32).-(21) 

X q2 (k + 8)1X0} = 

[I/ e (kj M40} - U 2 Od [v iz - CO}.2.cos(2k7i/16)]+x(0)/2-r :=) (05 .sin(2k7r/l6)+(0)cos(2toi/32)- U a (jc) {u^ (2Q}- 
?7 B CkJ {u. 1J (e')} .2.cos(2k7r/16)].2.sin((2k7i/32)+ u- 1 (0 ).2.sin( 

2k7i/16) sin(2k7i/32).-(22) 

X 22 (16 + k){*G)} = 

\Uc(.k) Lc(4e 0 } + E/ B (kj {v (Ql - CO }.2.cos(2toc/16)]+x(0)/2-v " (0j .sin(2k7i/16)- u [lij (0 )sin(2k7i/32)- U s (kj {u(2 e)}- 
t/gCkJ {u 1J ([')]■ .2.cos(2k7r/16)].2.cos((2k7i/32)- -li - 1 ■ (0 ).2.sin(2k7i/16) cos(2k7i/32).-(23) 

+ 243MOJ = 

[ t/ B (kj tc(40 } — t/g(kj [v y - ' 0J CO } .2.cos(2k7i/16)]+x(0)/2-o?' lJ CO) .sin(2k7i/16)- u '■ “ J (0 )cos(2k7i/32)- 

L?g(kJ {u (2 e)}- U s (kj {u’ A - ([')} .2.cos(2k7r/16)].2.sin((2k7i/32)- u - 1 ; (0 i.2.sin(2k7i/16) sin(2k7i/32). 

For k=0,1....3.-(24) 

- kjMOJ = 

”27 e (B - kj£r(40} — l/ # (S — kjJV - “- (0 }.2.cos(2k7i/16)]+x(0)/2+i? EuJ C0j.sin(2k7i/16)+u Ecj (0)cos(2k7i/32)+ 
t/ B (S — kj{ u E e - (2 e')}- 27gCS - kj )}. 2.cos(2k7i/l6)].2.sin((2k7i/32)+ u- 1 (0 ).2.sin(2k7i/l6) 

sin(2k7i/32).-(25) 

X 32 (16 — kj[r (e'j} = ’'-4.(8 — kj{r(4tj} + (S - kj jV c -(0 }.2.cos(2k7i/16)]+x(0)/2-i?'"' l -.0j.sin(2k7i/16)+ 
u l “ J (0 )sin(2k7i/32)- E/ B (3 - kj {ir“‘ (20 4- £/ b (B — kj {u iJ ( e')} .2.cos(2k7r/16)].2.cos((2k7i/32)+ u - 1 (0 ).2.sin(2k7i/l 
6) cos(2k;r/32).-(26) 

4 2 (24 -fc)[xCf]} = 

[l/ B (S — kjfcr (40 } + E/ B (3 - k){i? ElJ - C0}.2.cos(2k7i/16)]+x(0)/2+i? C “ J C0j.sin(2k7r/16)--w E “ J (0jcos(2k7i/32)- 
27 & (3 - kj { u (20}- t/ B (S — kj {u' 1J (E'j}.2.cos(2k7r/16)].2.sin((2k7i/32)--u" 1 -(0j.2.sin(2k7i/16)cos(2k7i/32). — (27) 
(32 -k)fe(0} = 

;t/ B (S — kj{r(40)+ t/ B (S - kj[v tJ CO}.2.cos(2k7i/16)]+x(0)/2-i?'" l -.0j.sin(2k7i/16)--u " (0 jsin(2k7r/32)+ 

E/ B (S - kj {u E e - (2 e j} + t/gCS — kj {u 1J (e j} .2.cos(2k7r/16)].2.cos(2k7i/32)- u 1 (0 i.2.sin(2k7i/16)cos(2k7i/32 
For k=0,l.4).-(28) 
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Fig. 1. (a) VLSI architecture for DHT of length N = 32 (Section A). 

(b) VLSI architecture for DHT of length N = 32 (Section B). 

VII. CONCLUSION 

In this brief, a new highly parallel VLSI algorithm 
for the computation of a length-V = In DHT having a modular 
and regular structure has been presented. Moreover, this 
algorithm can be implemented on a highly parallel 
architecture having a modular and regular structure with a low 
hardware complexity by extensively using a sub expression 
sharing technique and the sharing of multipliers having the 
same constant. 
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